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Abstract 

We consider Bayesian shrinkage predictions for the Normal regression problem 
under the frequentist Kullback-Leibler risk function. 

Firstly, we consider the multivariate Normal model with an unknown mean and 
a known covariance. While the unknown mean is fixed, the covariance of future 
samples can be different from training samples. We show that the Bayesian predic- 
tive distribution based on the uniform prior is dominated by that based on a class 
of priors if the prior distributions for the covariance and future covariance matrices 
are rotation invariant. 

Then, we consider a class of priors for the mean parameters depending on the 
future covariance matrix. With such a prior, we can construct a Bayesian predictive 
distribution dominating that based on the uniform prior. 

Lastly, applying this result to the prediction of response variables in the Normal 
linear regression model, we show that there exists a Bayesian predictive distribu- 
tion dominating that based on the uniform prior. Minimaxity of these Bayesian 
predictions follows from these results. 

Key words: Bayesian prediction, shrinkage estimation, Normal regression, superharmonic 
function, minimaxity, Kullback-Leibler divergence. 

1 Introduction 



Suppose that we have observations y ~ Nd(y, ^, E). Here Nd is the density function of 
the o?-dimensional multivariate Normal distribution with mean vector \x and covariance 
matrix E. We consider the prediction of y ~ N d (y; //, E) using a predictive density 
p(y\y). We assume that the mean of the distribution of unobserved (future) samples is 
the same as the one of the observed samples. However, the covariance matrices, E and 
E, are not necessarily the same or proportional to each other. We call a problem with 
such settings the "problem with changeable covariances." As we will show below, the 
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changeable covariance is a natural assumption when we consider the linear regression 
problems. 

In the present work, we assume that the mean vector lx is unknown and the covariance 
matrix E is known. We consider both cases where the future covariance E is known and 
unknown. 

We evaluate predictive densities p(y\y) by the KL loss function 



D(p(y\e)\m\y)) ■= I K#)io g ||j^d£ 



and the (frequentist) risk function 

Rkl(P,0) :-- 



p(y\B)D(p(y\B)\\p(y\y)W 



(2) 



We consider the Bayesian predictive density 



Pn(y\y) 



fp(y\0)*(e)M 



with prior n{6). For the Normal model, the Bayesian predictive density with the uniform 
prior vr/(/i) = 1 becomes 



Pw(y\y, E) 



1 



(2 7 r) rf / 2 |E + E| 1 /- 



exp 



(y- y y^ + t)-\y-y) 



as we will see in Section [2j Let Pn(y\y, E, E) denote p w (y\y) for short. 

When E is proportional to E, i.e. E = aE for a > 0, the problem is reduced to the one 
with E = vld and E = vld for positive scalar values v and v. This case with 'unchangeable 
covariances' has been well studied. The Bayesian predictive density 



pi(y\y;X,£) 



exp l — 



\y - y\ 



{2ii{y + v)} d / 2 ^ V 2(v + v) 
based on the uniform prior 7i"i(/f) = 1 dominates the plug-in density 



p{y\p) 



exp 



\y-y\ 

2v 



{2nv} d / 2 

with MLE, where fx = y. Moreover, by Murray ( 1977 ) and Ng f 1980l ). the Bayesian 
predictive densit y y-\{ii \ v) is the b e st pre dicti ve density that is inv ariant under the trans- 
lation group. In iLiang fc Barron! (12004J ) and iGeorge et al.l (120061 ) . the minimaxity of p\ 
was p roved. 



In iKomakil (120011 ). it was proved that the Bayesian predictive density Ps(y\y) with 
Stein prior 

7T S ( M ) := (3) 
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do minates the B a yesian predictive density pi(y\y) with the unifor m prior Tt\{n). 



George et al.l (120061 ) generalized the result of IKomakil (120011 ). Define the marginal 



distribution by 



m n (z; E) 



N(z; /i, S)7r(/i) d/i. 



(4) 



As we will see in Theorem 12.41 below, I George et al.l (120061 ) proved a sufficient condition 
on the prior 7r(/x) or the marginal distribution m n for p w (y\y) to domina te Pi(jj\y ) when 
E is proportional to E. In the present work, we generalize the results of Komakil ( 2001 ) 
and iGeorge et al.l (120061 ) to the corresponding problem with the changeable covariances, 
considerin g only fi n ite sample cases. Asymptotic prope rties of Bayes i an pr ediction are 



studied in IKomakil (119961 ). ICorcuera &: Giummolel (120001 ). and IKomakil (120061 ) 



2 Prior distributions independent of the future co- 
variance 

In this section, we develop and prove our main results concerning properties of p n (y\y) in 
the problem with changeable covariances. 

First we give three lemmas generalizing results proved in IGeorge et al. (120061 ) for the 
problem with "unchangeable" variances. 

Define the marginal distribution m % by (j4j). 

Lemma 2.1 If m 7 ^(z;'E) < oo for all z, then p w (y\y) is a proper probability density. 
Moreover, the mean of p n (y\y) is equal to the posterior mean E n [/i\y] if it exists. 

Let 

^-(E^ + E-^E-^ + E-^) 

and 

E W :=(E- 1 + E- 1 )- 1 . (5) 

As a function of the predictive density based on the uniform prior, the Bayesian 
predictive density based on a prior 7r(/x) becomes as follows: 

Lemma 2.2 

Mv\y) = rtv\y) m ^ ■ 

The following lemma is used for proving minimaxity olp^{y\y). 

Lemma 2.3 The Bayesian predictive density pi(y\y) is minimax under KL risk function 
R KL (p,p,). 
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Since the pro ofs of Lemma 12.11 an d Lemma 12.31 are almost same as those of Lemma 1 
and Lemma 3 in iGeorge et al.l (120061 ) , we omit them. We prove only Lemma 12.21 

Proof of Lemma 12.21 

p(y\^)p(v\^) 

1 / (y - fiyx-^y - 1 / {y - ^) T t-\y - ^) 



(27r) d / 2 |S| 1 /2 exp 2 / (27r) d / 2 |S| 1 /2 eXp V 2 



1 1 / {w-^) T Y, w l {w-^)\ ( y T Y.- x y fYr x y 



(2vr) d / 2 |S| 1 /2 ( 27r )rf/2|S|i/2 6XP V 2 / 6XP V 2 2 



exp 

1 1 ( (w - fi) T E- l (w - ^ ( (y-y) T (Z + £)-\y-y)- 

GXp I 1 GXp 



(27r) d / 2 |S| 1 /2( 27r )^/2|s|i/2 r \ 2 ; 1 V 2 

(6) 

In the last equation, we use 

E _1 (E -1 + E -1 ) -1 E -1 - E- 1 

= E-^E- 1 + E- 1 )-^- 1 - E-^E- 1 + fr 1 )- 1 ^- 1 + E- 1 ) 
= -E-^E^ + E- 1 )- 1 !]- 1 
= -(S + E)- 1 . 

From ([6]), the predictive density with the uniform prior !(//) = 1 is given by 



Pi(j/|?/) 

E- 1 + S- 1 r 1/2 (2vr)^ 2 exp ( - ' ^ ± ^ V - V) ) 



fp{y\v, S) dp 

( 2vr )^/2|E|l/2 ( 27r )^/2|E| 1 /2 l 

(y-y)T(i; + S)-l(j,-y)- 



= ( 27 r)-^|E + E|-V 2 exp ( - ^ ~ V) £ ± ±L 
Therefore 

Jp(y\fJ>, S)p(y|/i, S)7r(/i) d/i 



Pw(2/|l/) 



f p(y\^^)n(fi) d/x 

gryj/jj/) / N (w; /x, S w )tt(/z) d/x 
fN(y;n, S)7r(/x) d/i 
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□ 



Next, the difference of the risk functions of the two priors is evaluated. Let 
R K l{k,li) := / p(y\fx,T,)D(p(y\fx,E)\\p n (y\y))dy 



E) := J N(z; fx, E) logm^O; E) dz. 

Then from Lemma 12.21 

/Piu/y) 
p(y|A*> E) log dy dy 

P(y|/^, s ) log — 7 ' r dy dy 

= <f> n (p, E) -(f) n (n, E w ). (7) 

Now E w = (S -1 + E -1 ) -1 -< E. In order to prove Rkl^,^) < Rkl(^i, , it suffices to 
prove cj) n (fi, E) < ^(/z, E w ). 

Before stating the main results for the problem with changeable covariances, we review 
some results with a special setting, i.e., unchangeable covariances. 

An extended real- valued function ir(ji) on an open set R C W is said to be superhar- 
monic when it satisfies the following properties: 

1. — oo < 7i (fi) < oo and 7r(/x) ^ oo on any component of R. 

2. 7r(/u) is lower semi-continuous on i?. 

3. If G is an open subset of with compact closure G C R, w(pi) is a continuous 
function on G, w(fj) is harmonic on G, and 7r(/x) > on <9G, then 7r(/i) > 

on G. 

If 7r(/i) is a G 2 function, then vr(yu) is superharmonic on R if and only if An < on R. 



Theorem 2.4 (iKomakil (|200lh and iGeorge et al.l (120061 ^ 



Assume d > 3. 

(%) If tt(li) is the Stein prior 7Ts(/i), 

Ui > t> 2 > =>> ^(/i, < ^(/i, D 2 /d) /or a// /i. 
(ii) If tt(li) is a superharmonic function and m n (z; vlj) < oo for any z and v, 

vx > v 2 > 4> n (fj,,vil d ) < <f>n(iJ,,V2ld) for all \l. 

Furthermore, if m n (z;vld) is also not constant for all v 2 < v < V\, the inequality 
holds strictly. 



5 



(iii) If ^/m w (z; vld) is a superharmonic function for any v and m 7r (z;vld) < oo for 
any z and v, 

«i > v 2 > (f> w (n, vJd) < ff (/i, v 2 I d ) for all p.. 

Furthermore, if m n (z; vld) is also not constant for any u 2 < v < V\, the inequality 
holds strictly. 



We note that (iii) impli es (ii) and (ii) impli es (i). (i) was proved in iKomakil (120011 ). 



(ii) and (iii) were proved in iGeorge et al.l (120061 ). 



Theorem 12.51 is a generalization of (ii) of Theorem 12.41 to the problem with changeable 
covariances. For each prior 7r(/x), define a rescaled prior with respect to a positive definite 
d x d matrix E* by 

In particular, call vr S;S *(/i) := 7r s (E* -1 / 2 yu) as a rescaled Stein prior with respect to E*. 
We consider Bayesian risk with priors p(E) and p(E): 



7^kl(7T,//) = y j9(E)p(E)i?KL(vr, / u)dEdE, 



where dE means a Lebesgue measure for a vector space of all components of a matrix E. 
Define 

<P*(n) '■= J p(S)p(S)0 w (/i, S)dEdE 

p(E)p(E)N(-z; a*, E) log m^z; E) dzdEdE (8) 



<(/i) := J P (i;)p(t)M^ w )dmt 

p(E)p(E)N(z; /i, E w ) log 77^(2; E w ) dzdEdE. (9) 



Then from (j7j), 

^-kl(vt, //) - 7^kl(7Ti, A*) = </v(/i) - ^(A*)- (10) 

We consider the case where p(E), p(S), and 7r(/i) are rotation invariant. Here, a 
function /(E) of a matrix E £ M. dxd and a function /(//) of a vector |iGK w are said 
to be rotation invariant if /(E) = /(PEP T ) and g(/i) = g(Pp), respectively, for every 
orthogonal matrix P G M dxd . 

Theorem 2.5 

Lei d > 3. Ifp(T,) andp(E) are rotation invariant functions and it is a rotation invariant 
superharmonic prior, then 

^KL(7TS,/i) < 7?-KL(7ri, AO 

/or any In particular, the Bayesian predictive distribution pj]{y\y) with 7Ts dominates 
that based on tz\ if it is also not constant. 
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Proof. We note that m,(z; S) < oo for every z G M. d and positive definite matrix 
£ G R dx<i from Lemma [A. II in the appendix. 

First, we prove invariance of ip ns (fi) and cp™ (/j) under rotations of \x. 

Let P be a d x d orthogonal matrix, then 

<p* s (Pl*) = J p{E)p(E)N(z; P/i,£) log y N(z;n',J:)n^')dfi'dzdJ:dt 

]9(£)p(£)iV(5;/i,P T £P) log y N(z; fl' , P T ^P)n(^- 1/2 Pfi')djl'dzdJ:dt 
j9(PSP T )p(S)A^(5;/i,S)log y N(S; ft' ,J:)n(J:- 1/2 fi')dfi'dSdJ:df: 

= wO)- 

Proof of the rotation invariance of <^™ E (//) is nearly the same. 
We define 

Iis-^VII 

/i := arg max -r 

M=IHI ||S w 1/2 /i'|| 

and 

" s-'V 

Note that < r < 1, because £ is positive definite. Moreover 



^=^=1^7- (11) 



-V-l/2r/|| _ T ||V-l/2,-/|| > Nf; M II |iy,-l/2-/|| _ IIV-1/2-/ 
^u) A* II — ' ll^iu P II — ..1/2-,,,, H^to M II — 11^ P 

\\^-"w A* II 



for every /i'. 



From the rotation invariance of (p^, 



= E^ t [j N(z;n*,E) log y jV(*;/i, V)n(lT x l 2 p)dpdz\ 
= E^[J JV^E^V,/,) log y N&plj^'Wdz] 
= E^[J N(z]TH-^ 2 n*,I d )log J N{zr^hWWdz) 
= E^[J N{z;Y.- l,2 v\T- 2 I d )\og j N(zrv',T- 2 I d )n(TfL>WdS} 
< £ E ,fi[ y ^(*5 S~ 1/2 /i*, Id) log y iV(5; //', I d )n(TfL'Wdz} (12) 
= E Sft [J N{z- n m , E w ) log y ^(f^'.Sj^rE^/V')*'^] 
Here, inequality f[T2"j) is given by Theorem 12.41 (ii). 



Since every rotation invariant superharmonic function is radially nonincreasing, 

From this inequality, 

E*,t\j N{z;fi*,E w )log J N{z; fr', ^^{r^-^^djl'dz] 

<E^[j N{z; fi*,Z w ) log J N(z;Pf, ^(^^dfi'dz] 

= (13) 

In particular, if 7r is not constant, inequality (fl2]) holds strictly. Therefore, ps domi- 
nates p\. □ 



From Lemma 12.31 is proved to be minimax. 
Corollary 2.6 

Assume d>3. Let and p(E) fre rotation invariant continuous functions. If it is a 
rotation invariant superharmonic prior, Bayesian predictive density px(y\y) is minimax 
under TZkl- 

Theorem 12.51 and Corollary 12.61 can be generalized to the case with a semi-positive 
definite future covariance matrix S. Let £ be a rf-dimensional semi-positive matrix whose 
rank is k > 0. Then there is a d x k matrix L satisfying £ = LL T . Let {ai}fl^ be a 
set of orthogonal normalized vectors that are orthogonal to each column vector of L, i.e. 
L T ai = and ajaj = 5ij for i,j = 1, . . . ,d — k. Define the Normal distribution with 
semi-positive definite covariance matrix by 

where is Moore-Penrose pseudo-inverse of £. 

From the results of functional analysis, Nd(y; p., X) for any semi-positive definite S is 
equivalent to lim e ^ Nd(y; S + eld) as a functional on Schwartz functions of y. 

Using this equivalence and the bounded convergence theorem, equation (j7j) is valid for 
a semi-definite future covariance matrix if we define Y> w := (S^ 1 + Because £' ^ 0, 

r defined by ffTTj) takes value in (0, 1). Therefore, Theorem 12.51 and Corollary 12.61 hold for 
each semi-definite future covariance matrix S. 



3 Prior distributions depending on the future covari- 
ance 

In this section, we consider prior distributions depending on the future covariance matrix. 
Theorem 13.21 below says that every Bayesian prediction with an adequately metrized 
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prior dominates that based on the uniform prior. Although the assumption that priors 
can depend on the future covariance may seem strange, this assumption is natural when 
we consider the linear regression problem, as we will see in Section HI 

First, we generalize Theorem 12.41 to the case with non-identity covariances. Let /j, and 
z be vectors in M. d and let £ G IR dxd be a positive definite matrix. 

Let Si and £2 be positive definite matrices such that Si ^ £2. An orthogonal 
matrix U and a diagonal matrix A are given by a diagonalization of £j 1 Si' /2 , i.e. 
£i /2 £ 2 l ^\ 12 = U T AU. Let A* := S| /2 f/ T (A- 1 - I d f/ 2 . 

Proposition 3.1 If n is a prior s.t. ir(A*fi) is a superharmonic function of fi, then 

07rO, El) > 07rO, £ 2 ) ( 14 ) 

for any ji G M. d . Inequality [Lfy becomes strict if n is not a constant function. 

The following theorem is a direct result of Proposition 13.11 

Theorem 3.2 Ifir(A*fi) is a superharmonic function of \x, then Rk^(tt,^l) < Rkl(tti, //) . 
Furthermore, if it is not a constant function, a Bayesian predictive distribution p n domi- 
nates the one with the uniform prior ttj . 

Note that ir(A*fi) can be superharmonic only if rank(E 2 — Ex) > 3. 

Proof of PropositionHQand TheoremlQ Assume -< E x ^ E 2 and let S} /2 EJ l Y^ 2 
U T AU be a diagonalization. Then, 

= |log{/ ^) (2 ^ n y^ (- (X ~' jy X (X ~ V) ) iv 

1 ffip M*-M) T £-'(*-riu 



(2vr) d / 2 |E| 1 /2 

Let x := UH^x, ft = UYr 1 ! 2 ^ and v = UYr x l 2 v. By |E 2 |- 1 / 2 |E 1 | 1 /2 = | A | 1 /2 ; 

M^ 2 ) = J log [J ^r/^—i — eKp (- (£ "' )TA 2 " 1( '" P) ) 

1 / (x-u) T A _1 (x-/i)\ , 
exp ax 



(2vr) d / 2 |A| 1 /2 v _ / 

1 /2 ~t 1 /2 ~r 

where 7r(Ei Z7 •) is a prior distribution whose density function is represented by tt(E 1 U //) 
with a prior density vr(yu). 
Putting E 2 = Ei, we get 

<j>„(fj,, Ei) = ^tf/iuT.^P; J d), (16) 
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where Id is the d-dimensional identity matrix. 

We denote each diagonal component of A by Aj. Now < \ < 1 for each i since 
Si S 2 . Let di(t) := 1 + t(A i ~ 1 — 1) and A := diag(aj). Then 

<f>„(ji, S 2 ) - Si) 



1 d 



t=0 



E 

i=i 



<9t 9a/-( s i /2c/T -) 



,(/2,A) 



dt 



l d 



t=o 



E 

i=i 



<9t 9a 



a.i(t) 
dt 



CLi(t) 



1 e|t^-)(a^) 



dt 



where aj := (\ 1 — 1) 1 aj and fi := (A 1 — i^) 1 / 2 /i. 

By assumption, 7t(t4*-) for A* = S^ 2 f/ T (A _1 — Id) 1 ^ 2 is superharmonic. Now it 
sufficient to prove Lemma [3.31 iii) below. 



is 
□ 



Lemma 3.3 i) Ef =1 £-N(x; fx, A) = ±AN(x; /i, A). 

ii) J f(x — t)dfi(t) is a superharmonic function of x if f is a superharmonic function and 
^ is a positive measure on M d . 

Sf=i ^"^""(A 4 ' A) < for any « G M. d , > 0, and A = diag(aj) for each superhar- 
monic prior 7r. 



Proof of Lemma 13.31 Lemma i) fol l ows fr om direct calculation. For a proof of ii), see 
Problem 1.7.16 of iLehmann fc Casellal (119981 ). 



E|-«^)=E^/>o g / 

i=i 1 i=i w ^ 



vr(z/)A^(x; z/, A)dv} \ N(x; «, A)ob 



Et l j-I^)N(x;u,A)d^ 
J ir(v)N(x; u, A)dv 



-N(x] «, A)dx 



+ y" lo g{y vr(z/)AT(x;z/,A)ciz/}|^^-Ar(a;;u,A)rfa;. 



(17) 



Now, 



2-^ Pin- 



i=l 



tc(u)N(x; v, A)du = - A / tt(u)N(x; is, A)dv < 
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from Lemma 13.31 i) and ii). Thus, the first term of the right-hand side of ffTTj) is non- 
positive. The second term of the right-hand side of ffTTj) becomes 



^ lo S jy k(v)N(x',v, A)du^ AN(x;fi,A)dx 
^ A log jy ir(v)N(x]v,A)dv\ N(x;/i,A)dx 



by i) and the self-adjoint property of the Laplacian. S ince the logarithm o f a sup erhar- 



monic function is superharmonic (see Problem 1.7.16 of iLehmann &: Casellal (119981 ) ) . ( JT8 



is non-positive from ii). Thus Lemma [3.31 iii) is proved. □ 
Example 3.4 A rescaled Stein prior 

^S;E 2 -E 1 (/i)=||(S 2 -S 1 )- 1/ Vir ( ^ 2) 

satisfies the condition of Proposition 13.11 and Theorem 13.21 This is because 

||(E 2 - EO-VVH-^) = (/i T S- 1/2 (S- 1/2 S 2 S- 1/2 - h)- 1 ^ 2 ^-^ 2 

= ^ T Y>- 1/2 U T (k- 1 - h)-^- 112 ^-^' 2 . 

Thus, 7rs ; £ 2 -. Sl (A» = vr s (/i). 

4 Application to the Normal linear regression prob- 
lem 

In this section, we apply the results in the previous section to the Normal linear regression 
problem. 

Consider a Normal linear model 

y = X T p + e, (19) 



e ~ N p (0, a 2 l t 



V)i 



where the target variable y is a p dimensional vector, X is a d x p matrix composed of the 
explanatory variables, a 2 > is an unknown variance, and (3 is an unknown ci-dimensional 
vector. When the rightmost column of X is the constant vector (1, . . . , 1) , the model 
ffl9|) is a model with a intercept, y = X T /3 + j3 + e. 
We suppose that a future sample y is generated by 

y = X T P + e, (20) 

e~iV p (0,a%), 
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where y is a p dimensional vector, X is a d x p matrix, and cr 2 > is an unknown variance. 

In the present work, we assume that p > d and XX T is regular, however neither p > d 
nor regularity of XX T is necessary. 

We consider the prediction problem for the linear regression models (TTTJ]) and (TSUI) with 
KL risk function 

R KL (P,P.,X,X) := y p( Z/ |X;/?,a 2 )D(p(y|X;/5,a 2 )||^(y|X,y,X;a 2 ,a 2 ))d 2 /. 
and partial Bayesian risk function with prior p(X) and p{X): 

ftKiXAAr) := y p(X)p(X) J R K L(/5,p.,X,X)dXdX. 

Note that we do not assume any prior for j3. 

Next, the regression model is reduced to a Normal model discussed in Section [2j Let 
yt := (XX T )- 1 Xy and y 2 := y - X T (XX T )- 1 Xy. Then 

1 expf-^-^^-^V, 



(2vr ) P /2 V 2a 2 

( 2/l -/3) T S- 1 (y 1 -/5) 



~~ (2 7 r) d / 2 |S| 1 /2 

where 



exp J^(y 2 ; o- )dj/idj/ 2 , 

S := ^(II 1 )- 1 (21) 



and g(yz] cr 2 ) is a density function of 1/2 that is independent of y\ and (3. 

When y is given, y± is a sufficient statistic of (3, the maximum likelihood estimator, and 
the least-square estimator of /3. Thus, the regression model (fT9j) is reduced to a Normal 
model 

p(y 1 ;P,E) = N d (y 1 ;l3,E). (22) 
Similarly, the regression model (T2"U1) for the future samples is reduced to a Normal model 

p(y 1 ;(3,t) = N d (y l ;(3,t) (23) 

with semi-positive definite covariance matrix. Here y\ := {XX y Xy and 

S := a 2 (XX T ) f . (24) 

The KL risk of the Bayesian predictive density with a prior tt((3) for the regression 
problem becomes 
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Rkl(P«,P) = I p(y\X;(3,a 2 )D(p(y\X;(3,a 2 )\\ P7T (y\X,y,X))dy 
p(y\X;P,a 2 ) [ N d {y x] P,E)g(y 2 ; a 2 ) 



where 



, N d (yi;P^)g(y 2 ;a 2 ) 

log ~ dwidwodw 

/ Ndjy^ g, Z)g(y 2] a 2 )N d ( yi ; (3, Z)g(y 2 ; a 2 )n((3)df3 

fN d ( yi ;(3,E)g(y 2 ;^)7r((3)d/3 
N d ( yi ; (3, T,)D(N d (yi\ P, S) \\q*(yi\Vi))*Vi 
Rkl(q*,P), (25) 

As a result, the prediction problem for the regression model (119p and (|20|) is reduced 
to a prediction problem (122]) and (1251) . Using the result in Section |2J we construct a 
Bayesian prediction for the Normal regression problem. 

Define S, E, and S m by (I2T1) . fl24|) . and E^ = (E _1 + S^) -1 , respectively, then the 
following theorem and corollary hold. 

Theorem 4.1 

Let 7Ts(/3) = ^(E -1 / 2 /?). Lei and be rotation invariant continuous func- 

tions. 

(i) If it is a non-constant rotation invariant superharmonic function, then the Bayesian 
predictive density p^, with a prior tty, dominates p\ with the uniform prior ttj under the 
risk IZkl ■ 

(ii) If n is a rotation invariant superharmonic function, then p^ is minimax under the 
KL risk IZkl ■ 

Proof. If p(X) and p{X) are rotation invariant, then the distributions of E = a 2 (XX T )~ 1 
and E w = (o~~ 2 (XX T ) + cr~ 2 (XX T )) _1 are also rotation invariant. 

From Theorem 12.51 and Corollary 12. 6[ the theorem is derived directly. □ 

The assumption of rotation invariance of p(x) and p(x) is sometimes not realistic. 
If we consider priors depending on the future explanatory variables, we can construct a 
Bayesian prediction dominating the one with the uniform prior and, therefore, being a 
minimax prediction. 

Define an orthogonal matrix U and a diagonal matrix A by a diagonalization of 
Ej^E^Ej/ 2 , i.e. E^E-^i/ 2 = U T AU. Let A* : = S^CZ^A" 1 - I d ) 1 / 2 . Then the 
following theorem is a direct consequence of Theorem 13.21 
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Figure 1: An example of the Bayesian prediction based on the uniform prior and a rescaled 
Stein prior for the Normal regression model without an intercept term. 



Theorem 4.2 (i) If n(A*f3) is superharmonic w.r.t. (3 and it is non-constant, then the 
Bayesian prediction based on the prior tt dominates that based on the uniform prior. 
(ii)If 7r(A*/3) is superharmonic, then the Bayesian prediction based on the prior n is 
minimax. 

Note that n(A*(3) can be superharmonic only if the number of the future samples is 
more than two. 

5 Experimental results 

We show several experimental results on the Bayesian prediction with shrinkage priors for 
regression problems. 

Figures CD and [2] are examples of the regression problem. We consider the five di- 
mensional Normal regression models, without an intercept term (Figure [1]) and with an 
intercept term (Figure [2]). We set the true parameter (3 — (1, 0, . . . , 0) G R 5 . An explana- 
tory variable X is sampled from the uniform distribution U([— 1, l] 5x10 ) and corresponding 
target variable y is sampled from N W (X T (3, J 10 ). The target variable y for each explana- 
tory variable x = (xi, 0, . . . , 0) where x\ G [0, 2] is predicted by the Bayesian predictive 
density based on the uniform prior tt\ and that based on a rescaled Stein prior its-y, where 
£ = XX T . 

Two lines in Figures CD and [2] are y = j3jx for 7Ti and tis ; t,, respectively, where j3 n is 
the posterior mean with prior tc. In both figures, the slope of the line with rescaled Stein 
prior is smaller than the one with the uniform prior because the slope parameter (3 is 
shrunk to /3 — 0. Moreover in Figure [21 the intercept parameter is also shrunk. 

Figure [3] shows the distribution functions of the predictive density pi(y\x,y, X) with 
77"! and Ps ; e(z/|5, V, X) with 7Ts ; e, respectively, for (3 = x = e\ := (1, 0, . . . , 0) G M d . 
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Figure 2: An example of the Bayesian prediction based on the uniform prior and a rescaled 
Stein prior for the Normal regression model with an intercept term f3 = 1. 



Next, we show an example of Bayesian prediction whose prior depends on the explana- 
tory variables of future samples. We set x± = (a/3/2, 1/2, 0) T , x 2 = (a/3/2, —1/2, 0) T , 
x 3 = (0,0, 1) T , y x = a/3/2 + 1/2, y 2 = a/3/2 - 1/2 and y 3 = 0. Figure [5] is a graph of 
E-ks-a* [v\^i Vi x ) ^ or eacri value of x = (x^\ x^ 2 \ 0) G R x R x {0} with the rescaled Stein 
prior ttj:-a* ■ Here, the Bayesian estimation based on the uniform prior corresponds to the 
MLE (3 = (1, 1, 0), i.e. y = x« + x^ . 

We can see that the amount of shrinkage by the Bayesian prediction increases as the 
direction of x becomes closer to x^ 1 ' than x^ 2 ' , i.e. x T e\ becomes larger than x 1 e%. This 
fact is intuitively explained as follows: when explanatory variables of training samples 
are closer to x^ , x whose direction is close to x^ has more information than ones whose 
direction is close to x^ 2 \ Thus x close to a^ 1 -* need not be shrunk. 

Figure H] shows the risk functions of pi and for d = 3,5,7,9 and ||/3|| G [0,2]. 
The model has no intercept term. Here we assume that the columns of X and X are 
independently sampled from N 10 (0,I 10 ). 

Figure [6] compares five predictive densities: the Bayesian predictive density based on 
pi and p^ s , the ridge regression prior with regularization parameters A G {vTO, 10}, and 
the plug-in density of MLE. 

The ridge regression prior is 

WftA)=pg^ P (-A«) 

with a regularization parameter A > 0. We note that the posterior mean with the ridge 
regression prior is equivalent to the ridge regression estimator 

Prr = (XX T + \iy l Xy. 

When 1 1/3 1 1 is close to 0, the center of shrinkage, the risk based on the ridge regression 
prior re im becomes smaller than that based on 7T£. However, when \\/3\\ increases, the 
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Figure 3: Distribution functions of pi(y\x, y, X) and ps-s(y\x,y, X) where f3 — x — e± :— 
(1,0, ... ,0) E R d , X is a sample from U([-l, l] dxp ), y is a sample from N p (y;X T p, 10I P ), 
and p(y\x) = N(y; x T /3, 10). We generate 10 4 samples of y from each predictive distribu- 
tion. Sample means of Pj and Ps ; s are 1.3134 and 0.6898, respectively. 



prediction with ttrr becomes worse than the one with ttj and even worse than the plug-in 
distribution of the MLE. 

6 Conclusions and discussions 

In this paper, we considered the multivariate Normal model with an unknown mean and 
a known covariance. The covariance matrix can be changed after the first sampling. We 
assumed rotation invariant priors of the covariance matrix and the future covariance ma- 
trix. We showed that the shrinkage predictive density with the rescaled rotation invariant 
superharmonic priors is minimax under the Kullback-Leibler risk. Moreover, if the prior 
is not constant, Bayesian predicitive density based on the prior dominates the one with 
the uniform prior. 

In this case, the rescaled priors are independent of the covariance matrix of future 
samples. Therefore, we can calculate the posterior distribution and the mean of the 
predictive distribution (i.e. the posterior mean and the Bayesian estimate for quadratic 
loss) based on some of the rescaled Stein priors without knowledge of future covariance. 
Since the predictive density with the uniform prior is minimax, the one with each rescaled 
Stein prior is also minimax. 

Next we considered Bayesian predictions whose prior can depend on the future co- 
variance. In this case, we proved that the Bayesian prediction based on a rescaled super- 
harmonic prior dominates the one with the uniform prior without assuming the rotation 
invariance. 

Applying these results to the prediction of response variables in the Normal regres- 
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Figure 4: The risk difference of pj and ps-,T. for d = 3, 5, 7, 9 and ||/?|| m[0, 2]. We generate 
10 4 independent samples of X and X from AT 10 (0, Jio). Each line in the figure represents 
the sample mean of risk difference Rkl(P,Pi) — -Rkl(/3,Ps;s)- Each error bar represents 
the standard deviation. 



sion model, we show that there exists the prior distribution such that the corresponding 
Bayesian predictive density dominates that based on the uniform prior. Since the prior 
distribution is independent of future explanatory variables, both the posterior distribu- 
tion and the mean of the predictive distribution are independent of the future explanatory 
variables. 

The robustness of some shrink age methods as Stein estimators has been studied (see, 
for example, the bibliography in iRobertl (120011 )). The Stein effect has robustness in 
the sense that it depends on the loss function rather than the true distribution of the 
observations. Our result shows that the Stein effect has robustness with respect to the 
covariance of the true distribution of the future observations. 

As the dimension of the model becomes large, the risk improvement by the shrinkage 
with the rescaled Stein prior 7Ts increases as in Figur e HI An important example of 



th e high dimensional model is the k ernel methods (see Hastie et al. ( 200ll )). As noted 



in 



Cristianini fc Shawe- Taylor the feature space of kernel methods is a kernel 

reproducing Hilbert space whose dimension is as large as the sample size. Therefore 
Bayesian prediction based on shrinkage priors could be efficient for kernel methods. This 
is a future problem. 
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Figure 5: An example of Bayesian prediction whose prior depends on the explanatory 
variables of future samples. 



A Finiteness of the marginal distribution 

Here, we prove finiteness of the marginal distribution m n (fi, E). 

Lemma A.l If n is a superharmonic prior density function, the marginal distribution 
m 7T (x, E) is finite for every vector iGl" 1 and positive definite matrix E G R dxd . 

Proof. Fix a vector x G M. d From the definition of superharmonic functions, 7r ^ oo. 
Thus, 3xq G M. d s.t. vr(xo) < oo. If we set 7t(ji) '■— 7t(fi + xq), then tt is superharmonic 
and 7r(0) < oo. 

Let A max be the maximal eigenvalue of E and ro := ||x + xq\\, then 

m-irix, E) < I exp f- ^ + X ° — \ f(ji)dfi 

<f exp f_ W x + x ° — — ^ 7f (/i)d// + / exp T-^L^ 7f(/i)d// 

^ IHI<2ro V ^A max / J||^||>2ro V °^max/ 

(26) 

The first term of the right-hand side of fl26|) is finite because the integral of a super- 
harmo nic function over a compact subspace of M. d is finite (see Theorem 4.10 of [Helms 
(1l969f n. 

The second term is also finite because 

£ / expf-^)^)d^<cEexpf-|p^^(0){(n + l)ro} d 

for a positive constant C. Here we used a fact Jj, ,, <r 7f (//)d/i < C7r(0)r d by Theorem 4.9 
of iHelmj (Il969l ). Therefore, m n (x, E) < oo. □ 
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Figure 6: Comparison of the risk values by five predictive densities: the Bayesian predic- 
tive density based on pi and ps, the ridge regression prior with regularization parameters 
A = 10 and A = vTO = 3.16, and the plug-in density of the MLE. The model is five di- 
mensional and has no intercept term. We generate 10 4 independent samples of X and X 
from iVio(0, I\o). Each line in the figure represents the sample mean of the risk Rxh{/3,p) 
for the predictive density p. 



From this lemma, we see the assumption m n (z,vld) < oo in Theorem 12.41 (ii) is 
redundant. 
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